Platform Explorer / Nuxeo Platform 6.0

Extension point categorizers

Documentation

The service accepts the registration of parameterized categorizer implementations:

    <categorizer enabled="true" extractTextFromBinaryAttachments="true"
        factory="org.nuxeo.ecm.platform.categorization.categorizer.LanguageCategorizerFactory"
        minTextLength="100" name="language" precisionThreshold="0.8" property="dc:language">
        <skip>
            <facet name="Folderish"/>
        </skip>
        <mapping>
            <outcome name="german">de</outcome>
            <outcome name="english">en</outcome>
            <outcome name="french">fr</outcome>
        </mapping>
    </categorizer>

The optional attribute 'maxSuggestions' is only useful to bound the number of suggestions for multi-valued fields.

The optional attribute 'precisionThreshold' allow the system administrator to trade recall for precision by choosing not to categorize a document if the underlying categorizer implementation does not show enough confidence. The range of accepted values is implementation dependant.

The attribute 'minTextLength' tells the categorizer to guess provided text content length (in characters) is superior to the provided threshold. Categorizers tend to perform poorly on really small documents.

The textfield tags tell the list of xs:string or nxs:stringList document properties to use for document classification.

Contribution Descriptors

  • Class: org.nuxeo.ecm.platform.categorization.service.CategorizerDescriptor

Existing Contributions

Contributions are presented in the same order as the registration order on this extension point. This order is displayed before the contribution name, in brackets.

  • nuxeo-platform-categorization-subjects-6.0.jar
    <extension point="categorizers" target="org.nuxeo.ecm.platform.categorization.service.DocumentCategorizationService">
    
        <documentation>
          Default categorizer to guess the dc:subjects property of english documents.
        </documentation>
    
       <categorizer enabled="true" factory="org.nuxeo.ecm.platform.categorization.categorizer.tfidf.TfIdfCategorizerFactory" maxSuggestions="3" minTextLength="200" model="models/topics-51-tfidf-65536-model.gz" name="dc:subjects" precisionThreshold="1.5" property="dc:subjects">
         <skip>
          <facet name="Folderish"/>
         </skip>
         <mapping>
           <outcome name="architecture">art/architecture</outcome>
           <outcome name="comics">art/comics</outcome>
           <outcome name="cinema">art/cinema</outcome>
           <outcome name="culture">art/culture</outcome>
           <outcome name="danse">art/danse</outcome>
           <outcome name="art history">art/art history</outcome>
           <outcome name="literature">art/literature</outcome>
           <outcome name="music">art/music</outcome>
           <outcome name="painting">art/paint</outcome>
           <outcome name="photography">art/photography</outcome>
           <outcome name="show">art/show</outcome>
           <outcome name="rights">human sciences/rights</outcome>
           <outcome name="economy">human sciences/economy</outcome>
           <outcome name="geography">human sciences/geography</outcome>
           <outcome name="history">human sciences/history</outcome>
           <outcome name="journalism">human sciences/information</outcome>
           <outcome name="languages">human sciences/languages</outcome>
           <outcome name="phylosophy">human sciences/phylosophy</outcome>
           <outcome name="psychology">human sciences/psychology</outcome>
           <outcome name="sociology">human sciences/sociology</outcome>
           <outcome name="astronomy">sciences/astronomy</outcome>
           <outcome name="biology">sciences/biology</outcome>
           <outcome name="chemistry">sciences/chemistry</outcome>
           <outcome name="information technology">sciences/it</outcome>
           <outcome name="logic">sciences/logic</outcome>
           <outcome name="math">sciences/math</outcome>
           <outcome name="medicine">sciences/medicine</outcome>
           <outcome name="physic">sciences/physic</outcome>
           <outcome name="earthscience">sciences/earthscience</outcome>
           <outcome name="education">society/education</outcome>
           <outcome name="company">society/company</outcome>
           <outcome name="ecology">society/ecology</outcome>
           <outcome name="women">society/women</outcome>
           <outcome name="humanitarian">society/humanitarian</outcome>
           <outcome name="politic">society/politic</outcome>
           <outcome name="religion">society/religion</outcome>
           <outcome name="collection">daily life/collection</outcome>
           <outcome name="gastronomy">daily life/gastronomy</outcome>
           <outcome name="gardening">daily life/gardening</outcome>
           <outcome name="games">daily life/games</outcome>
           <outcome name="video games">daily life/video games</outcome>
           <outcome name="fashion">daily life/fashion</outcome>
           <outcome name="sexuality">daily life/sexuality</outcome>
           <outcome name="sport">daily life/sport</outcome>
           <outcome name="television">daily life/television</outcome>
           <outcome name="tourism">daily life/tourism</outcome>
           <outcome name="astronautic">technology/astronautic</outcome>
           <outcome name="electronic">technology/electronic</outcome>
           <outcome name="energy">technology/energy</outcome>
           <outcome name="industry">technology/industry</outcome>
           <outcome name="it">technology/it</outcome>
           <outcome name="robotic">technology/robotic</outcome>
           <outcome name="transport">technology/transport</outcome>
         </mapping>
    
       </categorizer>
    
      </extension>
  • nuxeo-platform-categorization-language-6.0.jar
    <extension point="categorizers" target="org.nuxeo.ecm.platform.categorization.service.DocumentCategorizationService">
    
        <documentation>
          Default categorizers to guess dc:language.
        </documentation>
    
        <categorizer enabled="true" factory="org.nuxeo.ecm.platform.categorization.categorizer.LanguageCategorizerFactory" name="dc:language" property="dc:language">
         <skip>
          <facet name="Folderish"/>
         </skip>
         <mapping>
           <outcome name="german">de</outcome>
           <outcome name="english">en</outcome>
           <outcome name="french">fr</outcome>
           <outcome name="spanish">sp</outcome>
           <outcome name="italian">it</outcome>
         </mapping>
       </categorizer>
    
      </extension>
  • nuxeo-platform-categorization-coverage-6.0.jar
    <extension point="categorizers" target="org.nuxeo.ecm.platform.categorization.service.DocumentCategorizationService">
    
        <documentation>
          Default categorizers to guess dc:coverage.
        </documentation>
    
        <categorizer enabled="true" factory="org.nuxeo.ecm.platform.categorization.categorizer.tfidf.TfIdfCategorizerFactory" maxSuggestions="1" model="models/countries-30-tfidf-65536-model.gz" name="dc:coverage" precisionThreshold="1.5" property="dc:coverage">
         <skip>
          <facet name="Folderish"/>
         </skip>
         <mapping>
           <outcome name="bangladesh">asia/Bangladesh</outcome>
           <outcome name="brazil">south-america/Brazil</outcome>
           <outcome name="china">asia/China</outcome>
           <outcome name="colombia">south-america/Colombia</outcome>
           <outcome name="congo">africa/Congo_Republic</outcome>
           <outcome name="egypt">africa/Egypt</outcome>
           <outcome name="ethiopia">africa/Ethiopia</outcome>
           <outcome name="france">europe/France</outcome>
           <outcome name="germany">europe/Germany</outcome>
           <outcome name="india">asia/India</outcome>
           <outcome name="indonesia">asia/Indonesia</outcome>
           <outcome name="iran">asia/Iran</outcome>
           <outcome name="italy">europe/Italy</outcome>
           <outcome name="japan">asia/Japan</outcome>
           <outcome name="mexico">north-america/Mexico</outcome>
           <outcome name="nigeria">africa/Nigeria</outcome>
           <outcome name="pakistan">asia/Pakistan</outcome>
           <outcome name="philippines">asia/Philippines</outcome>
           <outcome name="russia">europe/Russian_Federation</outcome>
           <outcome name="south_africa">africa/South_Africa</outcome>
           <outcome name="south_korea">asia/South_Korea</outcome>
           <outcome name="spain">europe/Spain</outcome>
           <outcome name="tanzania">africa/Tanzania</outcome>
           <outcome name="thailand">asia/Thailand</outcome>
           <outcome name="turkey">asia/Turkey</outcome>
           <outcome name="ukraine">europe/Ukraine</outcome>
           <outcome name="united_kingdom">europe/United_Kingdom_of_Great_Britain_N_Ireland</outcome>
           <outcome name="united_states">north-america/United_States_of_America</outcome>
           <outcome name="vietnam">asia/Viet_Nam</outcome>
         </mapping>
       </categorizer>
    
      </extension>